This project uses data from the 1974 Motor Trend US magazine (comprising fuel consumption and 10 aspects of automobile design and performance for 32 automobiles) to:
explore the relationship between various variables and fuel consumption
compare the fuel efficiency of automatic versus manual transmission cars
Environment
if (!require("pacman")) install.packages("pacman"); library(pacman)
pacman::p_load(dplyr, corrplot, ggplot2, ggthemr, plotly)
load data
data("mtcars")
view first few rows
head(mtcars, 4)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
The data has 32 observations of the following variables:
check for missingness
sum(is.na(mtcars))
## [1] 0
transforming numeric variables to factors
mydata <- mtcars %>% mutate_at(c("cyl", "vs", "am", "gear", "carb"), as.factor)
Fig 1: Correlation Plot
corrplot.mixed(cor(mtcars), order="hclust", tl.col="black")
Observations:
Fuel consumption (\(mpg\)) is highly correlated with automobile weight (\(wt\)), horsepower (\(hp\)), number of cylinders (\(cyl\)) and displacement (\(disp\)). These are potential predictors.
These potential predictors are themselves correlated. The best model will not contain all of the potential predictors.
Transmission type (\(am\)) is moderately correlated with the rear axel ratio (\(drat\)) and number of gears (\(gear\)). So, we might need to eliminate these variables to limit variance inflation due to collinearity.
Fig 2: Box plot: Fuel by Transmission
ggplot(mydata, aes(x=am, y=mpg, fill = am)) +
geom_boxplot() +
labs(x = "Type", y = "Miles per Gallon") +
scale_fill_discrete(labels = c("Automatic", "Manual"))
Observation: Manual cars in the sample have better fuel efficiency i.e. travel farther on comparable fuel than automatic cars.
how much farther?
auto <- mydata[mydata$am == 0,];man <- mydata[mydata$am == 1,]
mean(man$mpg) - mean(auto$mpg)
## [1] 7.244939
is this difference unique to our sample?
t.test(man$mpg, auto$mpg, alternative = c("two.sided"))
##
## Welch Two Sample t-test
##
## data: man$mpg and auto$mpg
## t = 3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 3.209684 11.280194
## sample estimates:
## mean of x mean of y
## 24.39231 17.14737
Observations:
Everything else equal, manual cars probably have a fuel efficiency advantage over automatic cars
However, ‘everything else’ need not be equal. As observed from \(Fig. 1\), factors such as weight, horsepower and number of cylinders may better explain differences in fuel consumption.
Fig 3: Box plot: Fuel by Transmission|Number of Cylinders
ggthemr("dust")
bp <- ggplot(mydata, aes(x=am, y=mpg, fill = am)) +
geom_boxplot() +
facet_wrap(. ~ cyl, labeller = "label_both")
ggplotly(bp)
Observation: As the number of cylinder increases, the difference in fuel consumption between automatic and manual transmission cars shrinks.
The goal is to gauge the effect of transmission type on fuel consumption after accounting for other predictors. And to investigate the best overall predictors of fuel consumption.
The model with only \(am\) as a predictor is the baseline.
baseline <- lm(mpg ~ am, data = mydata)
summary(baseline)
##
## Call:
## lm(formula = mpg ~ am, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3923 -3.0923 -0.2974 3.2439 9.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147 1.125 15.247 1.13e-15 ***
## am1 7.245 1.764 4.106 0.000285 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
## F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
Observations:
In this model, transmission type is a significant predictor but it explains only 34% of the variance in fuel consumption
If you knew nothing else about a car, you may sometimes guess correctly that a manual car will have better fuel efficiency than an automatic car, but you would almost certainly be wrong about the magnitude of the difference.
Variable Selection
We use stepwise regression as a preliminary way of identifying important variables.
full <- lm(mpg ~., mydata)
select <- MASS::stepAIC(full, direction = "both", trace = FALSE)
The stepwise method identifies \(cyl\), \(hp\), \(wt\) and \(am\) as potential predictors.
Model Building
summary(select)$adj.r.squared
## [1] 0.8400875
A model with these variables is much more informative (50% more explanatory power) but we still need to decide what combination of these predictors yields the best model.
To do so I fit a (partial) second order model and eliminate insignificant variables. This yields:
chosen <- lm(mpg ~ hp + wt + I(hp^2) + I(wt^2), data = mydata)
summary(chosen)
##
## Call:
## lm(formula = mpg ~ hp + wt + I(hp^2) + I(wt^2), data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8849 -1.8165 -0.3922 1.3499 4.5807
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.945e+01 3.521e+00 14.044 6.27e-14 ***
## hp -9.428e-02 3.193e-02 -2.952 0.006456 **
## wt -9.220e+00 2.270e+00 -4.062 0.000375 ***
## I(hp^2) 1.743e-04 8.073e-05 2.159 0.039879 *
## I(wt^2) 8.500e-01 3.005e-01 2.829 0.008700 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.135 on 27 degrees of freedom
## Multiple R-squared: 0.8907, Adjusted R-squared: 0.8745
## F-statistic: 55.02 on 4 and 27 DF, p-value: 1.363e-12
This model has fairly high explanatory power (87.5%) with few, all-significant predictors.
NB Adding transmission type does not improve the model’s R-squared, in fact the model is penalized for having an extraneous variable.
summary(lm(mpg ~ am + hp + wt + I(hp^2) + I(wt^2), data = mydata))$adj.r.squared
## [1] 0.8699158
This implies that knowing a car’s weight and horsepower yields the most information about its fuel consumption.
Before proceeding, we check that the assumptions for linear regression aren’t grossly violated.
par(mfrow=c(2,2))
plot(chosen)
par(mfrow=c(1,1))
Zero Mean and Equal Variance of Errors
The Residuals vs Fitted shows residuals with a roughly zero mean zero and more or less equal spread, suggesting an unbiased model.
Normality of Errors
The points in the top-right plot seem to have deviations from normality (they do not lie on the diagonal line). We perform a secondary check for normality:
shapiro.test(summary(chosen)$resid)
##
## Shapiro-Wilk normality test
##
## data: summary(chosen)$resid
## W = 0.94267, p-value = 0.08918
The null hypothesis of this test is that the errors are normally distributed. With a p-value of 0.09, we fail to reject the null i.e. we find no evidence that normality is violated.
Influential Outliers
The Residuals vs Leverage indicate the possibility of leverage points. To gauge influence, we look at the variables with Cook’s Distance > 4/n
which(cooks.distance(chosen) > 4/nrow(mydata))
## 16 17 18 20
## 16 17 18 20
However, these values do not appear to have been recorded in error and deleting them doesn’t significantly improve the model. So we stick to the chosen model.
chosen
##
## Call:
## lm(formula = mpg ~ hp + wt + I(hp^2) + I(wt^2), data = mydata)
##
## Coefficients:
## (Intercept) hp wt I(hp^2) I(wt^2)
## 49.4530494 -0.0942795 -9.2203220 0.0001743 0.8500054
Final Observations:
All things equal, a manual car will have better fuel efficiency than an automatic one.
Given the information we have, the best predictor of fuel efficiency is weight.
A car’s horsepower, displacement and number of cylinders provide additional information. But seeing as they are all highly correlated, it is more or less the same information. Horsepower is just slightly more informative.
Both weight and horsepower affect fuel efficiency in the opposite direction i.e. heavier cars and those with high horsepower consume more fuel.
__________________________________________________________END__________________________________________________________